{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Manipulating Metadata\n",
    "\n",
    "One of ParquetDB’s strengths is the ability to store and manage **metadata** alongside your dataset. You can attach metadata at:\n",
    "- **Dataset level** (e.g., `version`, `source`, etc.), which applies to the entire table or dataset.\n",
    "- **Field/column level** (e.g., `units`, `description`, etc.), which applies to specific columns.\n",
    "\n",
    "In this notebook, we’ll walk through:\n",
    "1. **Updating the Schema** – how to add or change fields in the dataset schema, including updating metadata.\n",
    "2. **Setting Dataset Metadata** – how to set or update top-level metadata for the entire dataset.\n",
    "3. **Setting Field Metadata** – how to set or update metadata for individual fields (columns).\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "The `update_schema` method allows you to modify the structure and metadata of your dataset. You can:\n",
    "- Change the data type of an existing field.\n",
    "- Add new fields (if your workflow demands it).\n",
    "- Update the **top-level** metadata (if `update_metadata=True`).\n",
    "- Optionally normalize the dataset after making schema changes by providing a `normalize_config`.\n",
    "\n",
    "\n",
    "```python\n",
    "def update_schema(\n",
    "    self,\n",
    "    field_dict: dict = None,\n",
    "    schema: pa.Schema = None,\n",
    "    update_metadata: bool = True,\n",
    "    normalize_config: NormalizeConfig = NormalizeConfig()\n",
    "):\n",
    "    ...\n",
    "```\n",
    "- `field_dict`: A dictionary of field updates, where keys are field names and values are the new field definitions (e.g., pa.int32(), pa.float64()), or pa.field(\"field_name\", pa.int32()).\n",
    "- `schema`: A fully defined PyArrow Schema object to replace or merge with the existing one.\n",
    "- `update_metadata`: If True, merges the new schema’s metadata with existing metadata. If False, replaces the metadata entirely.\n",
    "- `normalize_config`: A NormalizeConfig object for controlling file distribution after the schema update."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "============================================================\n",
      "PARQUETDB SUMMARY\n",
      "============================================================\n",
      "Database path: my_dataset\n",
      "\n",
      "• Number of columns: 3\n",
      "• Number of rows: 3\n",
      "• Number of files: 1\n",
      "• Number of rows per file: [3]\n",
      "• Number of row groups per file: [1]\n",
      "• Serialized metadata size per file: [717] Bytes\n",
      "\n",
      "############################################################\n",
      "METADATA\n",
      "############################################################\n",
      "\n",
      "############################################################\n",
      "COLUMN DETAILS\n",
      "############################################################\n",
      "• Columns:\n",
      "    - age\n",
      "    - id\n",
      "    - name\n",
      "\n"
     ]
    }
   ],
   "source": [
    "from parquetdb import ParquetDB\n",
    "from pathlib import Path\n",
    "import shutil\n",
    "import pyarrow as pa\n",
    "\n",
    "ROOT_DIR = Path(\".\")\n",
    "DATA_DIR = ROOT_DIR / \"data\"\n",
    "\n",
    "if DATA_DIR.exists():\n",
    "    shutil.rmtree(DATA_DIR)\n",
    "    \n",
    "db_path = ROOT_DIR / \"ParquetDB\"\n",
    "\n",
    "\n",
    "data = [\n",
    "    {\"name\": \"Alice\", \"age\": 30},\n",
    "    {\"name\": \"Bob\", \"age\": 25},\n",
    "    {\"name\": \"Charlie\", \"age\": 35},\n",
    "]\n",
    "\n",
    "db.create(data)\n",
    "print(db)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Update Schema"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "pyarrow.Table\n",
      "age: int64\n",
      "id: int64\n",
      "name: string\n",
      "----\n",
      "age: [[30,25,35]]\n",
      "id: [[0,1,2]]\n",
      "name: [[\"Alice\",\"Bob\",\"Charlie\"]]\n",
      "pyarrow.Table\n",
      "age: double\n",
      "id: int64\n",
      "name: string\n",
      "----\n",
      "age: [[30,25,35]]\n",
      "id: [[0,1,2]]\n",
      "name: [[\"Alice\",\"Bob\",\"Charlie\"]]\n"
     ]
    }
   ],
   "source": [
    "table = db.read()\n",
    "print(table)\n",
    "\n",
    "# Suppose we want to change the 'age' field to float64\n",
    "field_updates = {\n",
    "    \"age\": pa.field(\n",
    "        \"age\", pa.float64()\n",
    "    )  # or simply pa.float64() if your internal method accepts that\n",
    "}\n",
    "\n",
    "db.update_schema(field_dict=field_updates, update_metadata=True)\n",
    "\n",
    "table = db.read()\n",
    "print(table)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Setting Dataset Metadata"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "============================================================\n",
      "PARQUETDB SUMMARY\n",
      "============================================================\n",
      "Database path: my_dataset\n",
      "\n",
      "• Number of columns: 3\n",
      "• Number of rows: 3\n",
      "• Number of files: 1\n",
      "• Number of rows per file: [3]\n",
      "• Number of row groups per file: [1]\n",
      "• Serialized metadata size per file: [854] Bytes\n",
      "\n",
      "############################################################\n",
      "METADATA\n",
      "############################################################\n",
      "• source: API\n",
      "• version: 1.0\n",
      "\n",
      "############################################################\n",
      "COLUMN DETAILS\n",
      "############################################################\n",
      "• Columns:\n",
      "    - age\n",
      "    - id\n",
      "    - name\n",
      "\n"
     ]
    }
   ],
   "source": [
    "# Set dataset-level metadata, merging with existing entries\n",
    "db.set_metadata({\"source\": \"API\", \"version\": \"1.0\"})\n",
    "\n",
    "print(db)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If we call `set_metadata` again with additional keys:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{'source': 'API', 'version': '1.0', 'author': 'Data Engineer', 'department': 'Analytics'}\n"
     ]
    }
   ],
   "source": [
    "# Add more metadata, merging with the existing ones\n",
    "db.set_metadata({\"author\": \"Data Engineer\", \"department\": \"Analytics\"})\n",
    "\n",
    "print(db.get_metadata())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If you want to replace the existing metadata:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{'source': 'API_2', 'version': '2.0'}\n"
     ]
    }
   ],
   "source": [
    "# Replace existing metadata\n",
    "db.set_metadata({\"source\": \"API_2\", \"version\": \"2.0\"}, update=False)\n",
    "\n",
    "print(db.get_metadata())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Setting Field-Level Metadata\n",
    "\n",
    "If you want to attach descriptive information to specific fields (columns), use `set_field_metadata`. This is useful for storing **units of measurement**, **data lineage**, or other column-specific properties."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{'age': {'units': 'Years', 'description': 'Age of the person'}, 'id': {}, 'name': {}}\n"
     ]
    }
   ],
   "source": [
    "field_meta = {\"age\": {\"units\": \"Years\", \"description\": \"Age of the person\"}}\n",
    "\n",
    "db.set_field_metadata(field_meta)\n",
    "\n",
    "print(db.get_field_metadata())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "> **Note**: When physically stored, metadata is typically stored in the **Parquet file footer** and read by PyArrow upon loading. If you rely on certain metadata keys in your analysis, ensure your entire workflow consistently updates and preserves them."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "parquetdb_dev",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.20"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}